The most difficult part of this analysis was data cleaning. Files were in various formats and lacked the structure and ease-of-access that a database should provide in an application like this. Because the data was in such poor condition, there are undoubtedly features we were not able to identify or get into a usable format in the time allowed. We see this as our main analytic limitation.
Perhaps the biggest takeaway of this challenge from our perspective is the need to get our data in order as an organization. Hiring/training the most advanced data scientists available will do little good if there is no data for them to work with. Furthermore, if our chief problem is data cleaning/management, we should be hiring/training data engineers before data scientists.
The final merged data format at the unit/month level of granularity is shown below. Several interesting time series plots are displayed at the bottom of this page as a means of basic data exploration. More advanced visualizations are presented on the third tab of this dashboard.
*Double-click on a unit to hide all others
*Double-click on a unit to hide all others
*Double-click on a unit to hide all others
*Double-click on a unit to hide all others
*Double-click on a unit to hide all others
Key takeaway: the best models can predict flight hours by using ONLY the number of CH aircraft.
After running various models including simple linear regression, support vector machines (SVM), K-nearest neighbors (KNN), random forest, tree regression, and keras deep learning with a number of different independent variables, we found that the SVM algorithm using only the number of CH aircraft was the highest performing.
This is a critical an interesting finding. While we would never argue that the only driver or meaningful predictor of flight hours is number of CH aircraft, we can argue that this attribute has a disproportionate impact on flight hours.
The simple model using CH flight hours is as useful as any model - even using many features and/or advanced algorithms - and it could plausibly be used for planning purposes, assuming the decision maker is satisfied with the MAE of ~528 hours.
To put this prediction in context, it is important to define the naive model, that is, a reasonable and simple heuristic that a decision maker could easily employ without serious analysis. We defined the naive model as the number of flight hours in the previous year for each month. Using this simple approach, we achieve a MAE of 786 - this is the 24th best performing of all the models we trained (shown in the table). Our best model outperforms the naive model by ~258 MAE. Every model should be considered in relation to the naive model to give an idea of real world utility.
Whether or not our model is actually useful for a decision maker, it should prompt an investigation into what is causing this strong of a relationship between number of CH aircraft and flight hours.
Perhaps the most important insight comes from our simplest visualization: there was a big spike in flight hours in 2012. In general, it seems that we should expect fewer flight hours as the years continue. Any model that predicts overall flight hours increasing should be questioned (not necessarily discarded, but at least investigated).
The vertical colored bars represent the number of pilots flying each aircraft each year. The gray bars linking them represent pilots. It is easy to see the turnover by following the lines between the aircraft types.
Key takeaways:
The OH platform has been phased out. You can see where those pilots were relocated over the years.
The UH and AH platforms have been expanded significantly since 2011.
The vertical colored bars represent units over the years 2011 to 2018 while the gray bars between them represent people moving between them.
Key takeaway: The unit composition is fairly stable over the years. We can see that there is a significant amount of mixing between units, but overall, personnel composition is stable year to year.
Social network analysis shows how enlisted/NCO personnel are the “glue” that binds the human dimension of the military
The aviation social network is huge - 20 million edges. Ideally, we could provide an interactive network to promote further exploration, but graphs of this size are too large to reasonably render on-the-fly. We instead used graphistry to render an image of the full network.
This network visualization shows the interconnectedness of these units. Each node is an individual and a link denotes that two people were in the same unit for 12 months or longer. Blue nodes are enlisted, red nodes are warrant officers, and green nodes are officers.
The size of each node denotes it’s “importance” as measured by degree centrality. The Warrant and Enlisted nodes appear largest, showing their critical role in forming the Army’s social network.
Social networks like present a huge opportunity for knowledge management and expert identification. The Army is made up of people and their relationships with each other - the organization cannot know itself without understanding these links.